Problematic : Can we give a realistic estimation of the obesity level of someone purely based on their eating habits and physical condition (without their weight)
For this, we will use this dataset. It represents data collected online.
We found some limits about this dataset that you have to keep in mind while seing the results:
import pandas as pd
import numpy as np
import datetime
import scipy as sc
import random
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objs as go
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
df = pd.read_csv("ObesityDataSet.csv", sep=',')
df.head()
df.columns
df.info()
We can see that all of the columns do not have a single missing value, which is good news as it shows that the dataset is already relatively clean.
df.shape
df.describe()
display(df[df.duplicated()])
We can see that there are only 24 duplicated values out of 2111 originally, so we believe that we can afford to remove them, even though they MIGHT actually represent different people that happen to have the exact same characteristics.
Having the same individual multiple times will affect badly our prediction due to the algorithm we'll use, so we'll remove them.
Also, we said earlier that in this dataset, 77% of the data has been artificially implemented, so it is quite possible that some, if not all of these duplicated data correspond to artificial data that copied the values of another row.
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)
df.info()
First, let's rename the different columns so that all of them are more explicit towards what they represent.
For this I'm using the variables info found on https://www.sciencedirect.com/science/article/pii/S2352340919306985?via%3Dihub
df.columns = ['Gender', 'Age', 'Height', 'Weight', 'Family_history_with_overweight', 'Frequency_eat_high_caloric_food', 'Frequency_eat_vegetables', 'Number_of_main_meals',
'Frequency_eat_between_meals', 'Smoke', 'Frequency_water', 'Monitoring_calories_consumption', 'Frequency_physical_activity', 'Time_using_technology_devices',
'Frequency_alcohol', 'Main_transport', 'Obesity_level_category']
df.head()
We can see that some of the columns seem to have qualitative data.
For complexity and storage purposes, we will replace these values by integer values, but first let's check which columns apply to this transformation we want to do :
for col in df.columns:
if df[col].dtype=='object':
print(f"{col} : {df[col].unique()}")
Now that we know which column to take care of, let's change the values for all of the object type columns, and store their decoded labels in a dictionnary
df['Age'] = df['Age'].astype('uint8')
df['Height'] = df['Height'].round(2)
df['Weight'] = df['Weight'].round(1)
df.shape
We want to make the data AI friendly. For that we have to numerize them. We'll use one hot encoder for the main transport category because we think that there is no mathematical relation between the different kinds of transport.
For example we didn't manage to prove that "Walking = Public transportation*2". One hot encoding creates a different column for each value of the category and assign either 1 or 0 in it.
For the rest of the columns we used the LabelEncoder, which makes our dataset more easily readable by keeping the different values possible in the same column.
#Using OneHotEncoder for the Main_transport column
encoder = OneHotEncoder()
encoder.fit(pd.DataFrame(df['Main_transport']))
encoded_data = encoder.transform(pd.DataFrame(df['Main_transport']))
dense_encoded_data = encoded_data.toarray()
cols = [f'Main_transport_{col}' for col in encoder.categories_[0]]
encoded_df = pd.DataFrame(dense_encoded_data, columns=cols)
encoded_df = encoded_df.astype('uint8')
df = pd.concat([df.drop(columns=['Main_transport']), encoded_df], axis=1)
#Using LabelEncoder for the rest of the object type columns
encoder = LabelEncoder()
decoded_dict = {}
for col in df.columns:
if df[col].dtype=='object':
encoded_labels = encoder.fit_transform(df[col])
df[col] = encoded_labels
decoded_dict[col] = {label: value for label, value in zip(encoded_labels, encoder.inverse_transform(encoded_labels))}
df[col] = df[col].astype('uint8')
decoded_dict['Age'] = 'Numeric values'
decoded_dict['Height'] = 'Numeric values'
decoded_dict['Weight'] = 'Numeric values'
decoded_dict['MBI'] = 'Numeric values'
decoded_dict['Frequency_eat_vegetables'] = {1: 'Never', 2: 'Sometimes', 3: 'Always'}
decoded_dict['Number_of_main_meals'] = {1: '1/day', 2: '2/day', 3: '3/day', 4: '3+/day'}
decoded_dict['Frequency_water'] = {1: 'Less than 1L/day', 2: 'Between 1 and 2L/day', 3: 'More than 2L/day'}
decoded_dict['Frequency_physical_activity'] = {0: 'No activity', 1: '1 or 2 days/week', 2: '2 or 4 days/week', 3: '4+ days/week'}
decoded_dict['Time_using_technology_devices'] = {0: '0-2 hours/day', 1: '3-5 hours/day', 2: '5+ hours/day'}
decoded_dict['Main_transport_Automobile'] = {0: 'no', 1: 'yes'}
decoded_dict['Main_transport_Bike'] = {0: 'no', 1: 'yes'}
decoded_dict['Main_transport_Motorbike'] = {0: 'no', 1: 'yes'}
decoded_dict['Main_transport_Public_Transportation'] = {0: 'no', 1: 'yes'}
decoded_dict['Main_transport_Walking'] = {0: 'no', 1: 'yes'}
df.head()
for col in decoded_dict:
print(f"{col} : {decoded_dict[col]}")
For reference, the 'Obesity_level_category' value is based on the Mass Body Index formula.
MBI = weight / (height**2)
Here are the classifications :
df['MBI'] = df['Weight'] / df['Height']**2 #MBI for mass body index
for encoded_keys, category in decoded_dict['Obesity_level_category'].items():
print(f"{category} min = {min(df[df['Obesity_level_category']==encoded_keys]['MBI'])}")
print(f"{category} max = {max(df[df['Obesity_level_category']==encoded_keys]['MBI'])}")
Based on these numbers we can point out a few things :
decoded_dict['Obesity_level_category']
def GetCategory(mbi):
decoded_dict['Corrected_obesity_level_category'] = {0: 'Insufficient_Weight', 1: 'Normal_Weight', 2: 'Overweight', 3: 'Obesity_Type_I', 4: 'Obesity_Type_II', 5: 'Obesity_Type_III'}
if mbi < 18.5: return 0
elif mbi < 25: return 1
elif mbi < 30: return 2
elif mbi < 35: return 3
elif mbi < 40: return 4
else: return 5
df['Corrected_obesity_level_category'] = df['MBI'].apply(GetCategory).astype('uint8')
df.head()
Let's check if everything is correct
for encoded_keys, category in decoded_dict['Corrected_obesity_level_category'].items():
print(f"{category} min = {min(df[df['Corrected_obesity_level_category']==encoded_keys]['MBI'])}")
print(f"{category} max = {max(df[df['Corrected_obesity_level_category']==encoded_keys]['MBI'])}")
Let's do a recap on our new dataset
df.info()
df.isnull().sum()
df.describe(include='all')
# Ce texte est au format code
We're starting off by writing a method that we will use later :
autolabel is used for the barplots to display the value of each bar on top of it to facilitate the lecture.
def autolabel(bars, myax=None):
if myax==None:
for bar in bars:
height = bar.get_height()
ax.annotate('{}'.format(height),
xy=(bar.get_x() + bar.get_width() / 2, height),
xytext=(0, 3),
textcoords="offset points",
ha='center', va='bottom')
else:
for bar in bars:
height = bar.get_height()
myax.annotate('{}'.format(height),
xy=(bar.get_x() + bar.get_width() / 2, height),
xytext=(0, 3),
textcoords="offset points",
ha='center', va='bottom')
Now for the visualisation
obesity_types_df = df['Corrected_obesity_level_category'].value_counts()
lecture_order = list(range(len(decoded_dict['Corrected_obesity_level_category'])))
obesity_types_df = obesity_types_df.reindex(lecture_order, fill_value=0)
n_types = len(lecture_order)
label = [name for key,name in decoded_dict['Corrected_obesity_level_category'].items()]
cmap = plt.get_cmap('RdYlGn')
color = cmap(np.linspace(0, 1, len(label))[::-1])
fig, ax = plt.subplots(figsize=(8, 6))
ax.pie(obesity_types_df, labels=label, colors=color, autopct='%1.1f%%',explode=[0.005]*6)
plt.title('Obesity types representation in our dataset')
plt.tight_layout()
plt.show()
The majority of people are at least overweight. This is huge, given that it increases the risk of cancer, diabetes, heart disease and muscular and skeletal ageing.
This first graph shows how our study is important and proves that the problem is real.
obesity_types_df_men = df[df['Gender']==1]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_women = df[df['Gender']==0]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
lecture_order = list(range(len(decoded_dict['Corrected_obesity_level_category'])))
n_types = len(lecture_order)
index = np.arange(n_types)
bar_width = 0.35
fig, ax = plt.subplots(figsize=(9, 6))
bar_men = ax.bar(index, obesity_types_df_men, bar_width, color='dodgerblue', label='Men')
bar_women = ax.bar(index + bar_width, obesity_types_df_women, bar_width, color='violet', label='Women')
autolabel(bar_men)
autolabel(bar_women)
ax.set_title('Obesity types representation in our dataset by gender')
ax.set_xlabel('Obesity types')
ax.set_ylabel('Number of people from that category')
ax.set_xticks(index + bar_width / 2)
ax.set_xticklabels(decoded_dict['Corrected_obesity_level_category'].values(), rotation=45)
ax.legend()
fig.tight_layout()
plt.show()
There seems to be a correlation between gender and obesity (shown later with the correlation matrix). We'll keep this feature.
We also see that there is only one male that belongs in the Obesity_Type_III level in this dataset, we'll see if this is explained by a huge correlation between the gender and obesity showing that women tend to have more critical obese conditions than men, or if it is simply a bias we'll have to consider that our dataset simply does not have many men that belong to that category.
obesity_types_df_young = df[df['Age']<20]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_adult = df[(df['Age']>=20) & (df['Age']<30)]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_senior = df[(df['Age']>=30) & (df['Age']<40)]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_old = df[df['Age']>40]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
lecture_order = list(range(len(decoded_dict['Corrected_obesity_level_category'])))
n_types = len(lecture_order)
index = np.arange(n_types)
bar_width = 0.2
fig, ax = plt.subplots(figsize=(9, 6))
bar_young = ax.bar(index, obesity_types_df_young, bar_width, color='DarkBlue', label='0-19')
bar_adult = ax.bar(index + bar_width, obesity_types_df_adult, bar_width, color='royalblue', label='20-29')
bar_senior = ax.bar(index + 2*bar_width, obesity_types_df_senior, bar_width, color='DodgerBlue', label='30-39')
bar_old = ax.bar(index + 3*bar_width, obesity_types_df_old, bar_width, color='LightSkyBlue', label='40+')
autolabel(bar_young)
autolabel(bar_adult)
autolabel(bar_senior)
autolabel(bar_old)
ax.set_title('Obesity types representation in our dataset by age')
ax.set_xlabel('Obesity types')
ax.set_ylabel('Number of people from that category')
ax.set_xticks(index + bar_width * 1.5)
ax.set_xticklabels(decoded_dict['Corrected_obesity_level_category'].values(), rotation=45)
ax.legend()
fig.tight_layout()
plt.show()
It's very difficult to be obese when you're young, due to morphological phenomena. However, there are still young people who are obese. This is confirmed by the fact that the number of obese people is increasing drastically as the population ages.
obesity_types_df_young_men = df[(df['Age']<20) & (df['Gender']==1)]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_adult_men = df[(df['Age']>=20) & (df['Age']<30) & (df['Gender']==1)]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_senior_men = df[(df['Age']>=30) & (df['Age']<40) & (df['Gender']==1)]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_old_men = df[(df['Age']>40) & (df['Gender']==1)]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_young_women = df[(df['Age']<20) & (df['Gender']==0)]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_adult_women = df[(df['Age']>=20) & (df['Age']<30) & (df['Gender']==0)]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_senior_women = df[(df['Age']>=30) & (df['Age']<40) & (df['Gender']==0)]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_old_women = df[(df['Age']>40) & (df['Gender']==0)]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
lecture_order = list(range(len(decoded_dict['Corrected_obesity_level_category'])))
n_types = len(lecture_order)
index = np.arange(n_types)
bar_width = 0.2
fig = plt.figure(figsize=(16, 8))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
bar_young_men = ax1.bar(index, obesity_types_df_young_men, bar_width, color='aquamarine', label='0-19')
bar_adult_men = ax1.bar(index + bar_width, obesity_types_df_adult_men, bar_width, color='deepskyblue', label='20-29')
bar_senior_men = ax1.bar(index + 2*bar_width, obesity_types_df_senior_men, bar_width, color='royalblue', label='30-39')
bar_old_men = ax1.bar(index + 3*bar_width, obesity_types_df_old_men, bar_width, color='navy', label='40+')
autolabel(bar_young_men, ax1)
autolabel(bar_adult_men, ax1)
autolabel(bar_senior_men, ax1)
autolabel(bar_old_men, ax1)
ax1.set_title('Obesity types representation in our dataset by age for men')
ax1.set_xlabel('Obesity types')
ax1.set_ylabel('Number of people from that category')
ax1.set_xticks(index + bar_width * 1.5)
ax1.set_xticklabels(decoded_dict['Corrected_obesity_level_category'].values(), rotation=45)
ax1.legend()
bar_young_women = ax2.bar(index, obesity_types_df_young_women, bar_width, color='lightpink', label='0-19')
bar_adult_women = ax2.bar(index + bar_width, obesity_types_df_adult_women, bar_width, color='magenta', label='20-29')
bar_senior_women = ax2.bar(index + 2*bar_width, obesity_types_df_senior_women, bar_width, color='crimson', label='30-39')
bar_old_women = ax2.bar(index + 3*bar_width, obesity_types_df_old_women, bar_width, color='maroon', label='40+')
autolabel(bar_young_women, ax2)
autolabel(bar_adult_women, ax2)
autolabel(bar_senior_women, ax2)
autolabel(bar_old_women, ax2)
ax2.set_title('Obesity types representation in our dataset by age for women')
ax2.set_xlabel('Obesity types')
ax2.set_ylabel('Number of people from that category')
ax2.set_xticks(index + bar_width * 1.5)
ax2.set_xticklabels(decoded_dict['Corrected_obesity_level_category'].values(), rotation=45)
ax2.legend()
fig.tight_layout()
plt.show()
These graphs confirm what we have seen and concluded earlier.
Indeed, the majority of the people contained in this dataset being in their 20s, and the majority of them being at least Overweight for men and in Obesity_Type_III for women.
This confirms the importance of our subject for the upcoming generations.
obesity_types_df_family = df[df['Family_history_with_overweight']==1]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_no_family = df[df['Family_history_with_overweight']==0]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_eat_hcf = df[df['Frequency_eat_high_caloric_food']==1]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_no_eat_hcf = df[df['Frequency_eat_high_caloric_food']==0]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_eat_once_between_meals = df[df['Frequency_eat_between_meals']==0]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_eat_twice_between_meals = df[df['Frequency_eat_between_meals']==1]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_eat_thrice_between_meals = df[df['Frequency_eat_between_meals']==2]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_eat_more_between_meals = df[df['Frequency_eat_between_meals']==3]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_never_eat_vegetables = df[df['Frequency_eat_vegetables']<2]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_often_eat_vegetables = df[df['Frequency_eat_vegetables']<3]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_always_eat_vegetables = df[df['Frequency_eat_vegetables']==3]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
lecture_order = list(range(len(decoded_dict['Corrected_obesity_level_category'])))
n_types = len(lecture_order)
index = np.arange(n_types)
bar_width = 0.2
fig = plt.figure(figsize=(25, 22))
ax1 = fig.add_subplot(4,1,1)
ax2 = fig.add_subplot(4,1,2)
ax3 = fig.add_subplot(4,1,3)
ax4 = fig.add_subplot(4,1,4)
bar_family = ax1.bar(index, obesity_types_df_family, bar_width, color='darkorange', label='Family with obesity')
bar_no_family = ax1.bar(index + bar_width, obesity_types_df_no_family, bar_width, color='saddlebrown', label='Family without obesity')
autolabel(bar_family, ax1)
autolabel(bar_no_family, ax1)
ax1.set_title('Obesity types representation in our dataset depending on the history of obesity in the family')
ax1.set_xlabel('Obesity types')
ax1.set_ylabel('Number of people from that category')
ax1.set_xticks(index + bar_width / 2)
ax1.set_xticklabels(decoded_dict['Corrected_obesity_level_category'].values(), rotation=45)
ax1.legend()
bar_hcf = ax2.bar(index, obesity_types_df_eat_hcf, bar_width, color='mediumspringgreen', label='High caloric food')
bar_no_hcf = ax2.bar(index + bar_width, obesity_types_df_no_eat_hcf, bar_width, color='lightseagreen', label='No high caloric food')
autolabel(bar_hcf, ax2)
autolabel(bar_no_hcf, ax2)
ax2.set_title('Obesity types representation in our dataset depending on if they eat high caloric food')
ax2.set_xlabel('Obesity types')
ax2.set_ylabel('Number of people from that category')
ax2.set_xticks(index + bar_width / 2)
ax2.set_xticklabels(decoded_dict['Corrected_obesity_level_category'].values(), rotation=45)
ax2.legend()
bar_once = ax3.bar(index, obesity_types_df_eat_once_between_meals, bar_width, color='greenyellow', label='Once')
bar_twice = ax3.bar(index + bar_width, obesity_types_df_eat_twice_between_meals, bar_width, color='lime', label='Twice')
bar_thrice = ax3.bar(index + bar_width * 2, obesity_types_df_eat_thrice_between_meals, bar_width, color='forestgreen', label='Thrice')
bar_more = ax3.bar(index + bar_width * 3, obesity_types_df_eat_more_between_meals, bar_width, color='darkgreen', label='More')
autolabel(bar_once, ax3)
autolabel(bar_twice, ax3)
autolabel(bar_thrice, ax3)
autolabel(bar_more, ax3)
ax3.set_title('Obesity types representation in our dataset depending on the frequency of eating between meals')
ax3.set_xlabel('Obesity types')
ax3.set_ylabel('Number of people from that category')
ax3.set_xticks(index + bar_width / 4)
ax3.set_xticklabels(decoded_dict['Corrected_obesity_level_category'].values(), rotation=45)
ax3.legend()
bar_never = ax4.bar(index, obesity_types_df_never_eat_vegetables, bar_width, color='royalblue', label='Always')
bar_often = ax4.bar(index + bar_width, obesity_types_df_often_eat_vegetables, bar_width, color='blueviolet', label='Often')
bar_always = ax4.bar(index + bar_width * 2, obesity_types_df_always_eat_vegetables, bar_width, color='darkmagenta', label='Never')
autolabel(bar_never, ax4)
autolabel(bar_often, ax4)
autolabel(bar_always, ax4)
ax4.set_title('Obesity types representation in our dataset depending on the frequency of eating vegetables')
ax4.set_xlabel('Obesity types')
ax4.set_ylabel('Number of people from that category')
ax4.set_xticks(index + bar_width / 3)
ax4.set_xticklabels(decoded_dict['Corrected_obesity_level_category'].values(), rotation=45)
ax4.legend()
fig.tight_layout()
plt.show()
There is an unsettling fact here visible in these graphs.
We can see that ALL of the 264 data categorized as Obesity_Type_III seem to always share the SAME value for multiple rows, the one exception probably being the single male belonging to that category, that is most likely a real human data.
This is troubling and concerning, because it means that there is absolutely no diversity in the latest obesity_types. Where this could be interpreted by the fact that maybe many values are similar because they, scientifically, hugely impact the obesity of an individual, this still seems unlikely to see this event happen in such a massive scale.
This is further proof that working with artificially created data by oversampling complicates our work here, as there are many biases that will affect our prediction model, and we can't really help but to acknowledge and allow them to exist.
plot = sns.pairplot(df,
hue ='Corrected_obesity_level_category',
vars = ['Gender','Age',
'Height', 'Weight'])
category_labels = {
'0': 'Insufficient_Weight',
'1': 'Normal_Weight',
'2': 'Overweight',
'3': 'Obesity_Type_I',
'4': 'Obesity_Type_II',
'5': 'Obesity_Type_III',
}
handles, legend_labels = plot._legend_data.values(), plot._legend_data.keys()
for category, label in zip(legend_labels, plot._legend.texts):
if category in category_labels:
label.set_text(category_labels[category])
plt.show()
This is a representation of the repartition of our dataset regarding the four most basic features used to describe someone to get a global view of the population represented in our work here.
Here, we'll see if some features aren't useable in terms of data distribution.
fig, ax = plt.subplots(1, figsize=(17, 20))
df.hist(ax=ax)
"Family history with overweight", "Frequency eat high caloric food", "number of main meals", "Smoke" and "monitoring calory consumption" have very concentrated values. So we will check if these values have a real impact on the target (obesity type). If not, these features won't be necessary (rare and less important events, these caracteritics don't make ML algorithms work efficiently).
fig, ax = plt.subplots(5,1, figsize=(10,17))
ax[0].scatter(df["Number_of_main_meals"], df["MBI"])
ax[0].set_title("MBI = f(Number_of_main_meals)")
ax[1].scatter(df["Monitoring_calories_consumption"], df["MBI"])
ax[1].set_title("MBI = f(Monitoring_calories_consumption)")
ax[2].scatter(df["Family_history_with_overweight"], df["MBI"])
ax[2].set_title("MBI = f(Family_history_with_overweight)")
ax[3].scatter(df["Frequency_eat_high_caloric_food"], df["MBI"])
ax[3].set_title("MBI = f(Frequency_eat_high_caloric_food)")
ax[4].scatter(df["Smoke"], df["MBI"])
ax[4].set_title("MBI = f(Smoke)")
In every graph (except monitoring calory consumption and family history with overweight) we don't see a huge impact of the feature on the result. Given that there are rare event, we won't keep this features.
Also on the histogram of distribution of frequency water (above) is continuous but there where only 4 different answers possible during the survey. We explain it with the oversampling action that they did. Below you'll see the proportion of normal values (discrete ones) and none normal.
However, these graphs do not represent the BEST way to show the correlation between the observed features and our target. We will try to find another way to show this later.
# Function to count decimal values in a column
def is_decimal(value):
if value%1 != 0:
return True
else:
return False
# Apply the function to the specified column
decimal_counts = df['Frequency_water'].apply(lambda x: is_decimal(x))
decimal_counts += df['Number_of_main_meals'].apply(lambda x: is_decimal(x))
decimal_counts += df['Frequency_eat_vegetables'].apply(lambda x: is_decimal(x))
decimal_counts += df['Frequency_physical_activity'].apply(lambda x: is_decimal(x))
decimal_counts += df['Time_using_technology_devices'].apply(lambda x: is_decimal(x))
decimal_counts = decimal_counts.value_counts()
percentage_decimal_water = (100*decimal_counts[0])/df.shape[0]
print(f"there is {decimal_counts[0]} decimal values so it represents {percentage_decimal_water:.2f}%")
print(decimal_counts)
So this is confirms the fact that 77% are artificial values (because 25% of values are discrete and 23% < 25%). In other words it is possible that some of the artificially generated rows got integer values in all of these observed columns.
Fortunately this shouldn't affect too badly our clustering results (shoud be the contrary in fact). We just have to keep in mind that our results will be kind of overfitting (because of the oversampling we overfit the sample of the study (which is small: 485)).
df_scope_1 = df[['Monitoring_calories_consumption', 'Frequency_eat_vegetables',
'Frequency_eat_between_meals', 'Frequency_water']]
df_scope_2 = df[['Frequency_physical_activity', 'Time_using_technology_devices',
'Frequency_alcohol', 'Obesity_level_category']]
plt.figure(figsize=(15, 3))
df_scope_2.boxplot(column=['Frequency_physical_activity', 'Time_using_technology_devices',
'Frequency_alcohol', 'Obesity_level_category'])
plt.show()
plt.figure(figsize=(15, 3))
df_scope_1.boxplot(column=['Monitoring_calories_consumption', 'Frequency_eat_vegetables',
'Frequency_eat_between_meals', 'Frequency_water'])
plt.show()
fig = plt.figure(figsize=(15,3))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
ax1.boxplot(df["Age"])
ax2.boxplot(df["Height"])
ax1.grid()
ax2.grid()
ax1.set_xlabel('Age')
ax2.set_xlabel('Height')
plt.show()
There is not relevent outliers to drop (for example for the age it's important to represent the population, the only problem that is shown is that the sample isn't representative for the population)
df[["Age", "Height", "Frequency_physical_activity", "Time_using_technology_devices"]].describe()
We notice a small Standard Deviation for Height, it's because the data has not been normalized yet, but in the boxplots the distribution isn't that problematic. Globally, we don't have STD problems.
To sum up, we'll keep all of the features we have for the moment.
Before dropping anything, we will check the correlation for each of them to make sure we're making the right choice.
corr = df.corr()
f, ax = plt.subplots(figsize=(16, 10))
mask = np.triu(np.ones_like(corr, dtype=bool))
cmap = sns.color_palette("YlOrBr", as_cmap=True)
sns.heatmap(round(corr,2), mask=mask, cmap=cmap, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5},
annot = True, annot_kws={"size": 8})
highly_correlated_features = []
threshold = 0.7
for i in range(len(corr.columns)):
for j in range(i):
if abs(corr.iloc[i, j]) > threshold:
colname = corr.columns[i]
rowname = corr.index[j]
highly_correlated_features.append((colname, rowname))
highly_correlated_features
Because we're aiming to predict obesity level based on today's life criteria and without weight (and MBI), we can get rid of them for the prediction.
Also 'Main_transport_Public_Transportation' and 'Main_transport_Automobile' are highly correlated, let's take a deeper view into it.
transport_corr = df[["Main_transport_Public_Transportation", "Main_transport_Automobile"]]
transport_corr_repartition = transport_corr.groupby(["Main_transport_Public_Transportation", "Main_transport_Automobile"]).value_counts().reset_index()
def transport_car_or_bus(row):
if row["Main_transport_Public_Transportation"] and row["Main_transport_Automobile"]:
return "Automobile & Public transport"
elif row["Main_transport_Public_Transportation"]:
return "Public transport"
elif row["Main_transport_Automobile"]:
return "Automobile"
return "Other"
transport_corr_repartition["transport"] = transport_corr_repartition.apply(transport_car_or_bus, axis=1)
transport_corr_repartition.rename(columns={0:"count"}, inplace=True)
colors = ['gold','cornsilk','goldenrod']
plt.pie(transport_corr_repartition["count"], labels=transport_corr_repartition["transport"], colors = colors)
plt.title('Pie chart of Automobile and Public transport repartition in Central America')
plt.tight_layout()
plt.show()
It seems that public transport and automobile are highly correlated (negatively). It is because they share a majority of exactly different values, since the "other" category represents a minority.
contingency_table = pd.crosstab(transport_corr_repartition['Main_transport_Automobile'], transport_corr_repartition['Main_transport_Public_Transportation'], values=transport_corr_repartition['count'], aggfunc='sum', margins=True, margins_name='Total')
plt.figure(figsize=(8, 6))
sns.heatmap(contingency_table, annot=True, cmap="Blues", fmt='g', cbar=True)
plt.title('Double Entry Table - Automobiles vs Public transport')
plt.show()
As we see there aren't a lot of 0-0 couples (False False square). And as we said above there are a lot of (0-1) and (1-0) because not many people answered with something else than Public transport or automobile.
plt.pie([49,28,23], labels =["Public transport", "Automobile", "Bike or walking"],
colors = ['thistle','plum','mediumpurple'])
plt.title("Mexico City's urban areas transportation in 2014")
plt.tight_layout()
plt.show()
On the pie plot above, it also appears that a vast majority of central americans are moving by public transportation only.
This is suprising given that the asked question was : "Which transportation do you usually use". Here is a review of transportation in Mexico city (where public transportation is expected to be higher than in central america (our dataset)) (see)
We could argue that this is because in the survey they hade to choose one mean of transport, however the difference is too big to be explained by only this reason.
However we'll continue with this feature, knowing that our dataset is globally messed up (as we feared when describing the dataset).
To sum up here are the features that we'll remove:
(for obvious reasons)
df.describe()
sorted_corr = corr['Corrected_obesity_level_category'].abs().sort_values()
print(sorted_corr)
Based on the correlation heatmap we've done before, we can directly see which variable have a bigger correlation with our "corrected_obesity_level_category".
However we will follow what we've seen in our previous analysis and drop some columns based on the work done in part III.
Since we want to predict the obesity type and we have access to this through the column "corrected_obesity_level_category", we're gonna do some supervised learning.
Let's make a quick pipeline to quickly see different algorithms first.
Y = df['Corrected_obesity_level_category']
X = df.drop(columns = ['Corrected_obesity_level_category', 'MBI', 'Obesity_level_category', 'Weight','Number_of_main_meals', 'Main_transport_Bike', 'Main_transport_Automobile', 'Main_transport_Motorbike','Main_transport_Public_Transportation','Main_transport_Walking', 'Smoke'])
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2,random_state=1, stratify = Y)
First we split the dataset into training and testing, and X (our variables) and Y (what we want to predict)
ss = StandardScaler()
x_train_scaled = ss.fit_transform(x_train)
x_test_scaled = ss.transform(x_test)
mm = MinMaxScaler()
x_train_mm_scaled = ss.fit_transform(x_train)
x_test_mm_scaled = ss.transform(x_test)
We're gonna test the algorithms for two different scalers.
pipeline = [
('Random Forest', RandomForestClassifier()),
('Decision Tree', DecisionTreeClassifier()),
('KNN', KNeighborsClassifier()),
('SVM', SVC())
]
We're testing for 4 different classification algorithms.
def model_predict(X_train: pd.DataFrame, y_train: pd.DataFrame,X_test: pd.DataFrame, y_test: pd.DataFrame):
for name, model in pipeline:
clf = model.fit(x_train, y_train)
y_pred = clf.predict(x_test)
clf_scaled = model.fit(x_train_scaled, y_train)
y_pred_ss_scaled = clf_scaled.predict(x_test_scaled)
clf_mm_scaled = model.fit(x_train_scaled, y_train)
y_pred_mm_scaled = clf_scaled.predict(x_test_scaled)
accuracy = round(accuracy_score(y_test, y_pred),5)
scaled_ss_accuracy = round(accuracy_score(y_test, y_pred_ss_scaled),5)
scaled_mm_accuracy = round(accuracy_score(y_test, y_pred_mm_scaled),5)
print(name + ':')
print("---------------------------------------------------------------")
print("Accuracy:", accuracy)
print("Accuracy w/Scaled_ss_accuracy Data (ss):", scaled_ss_accuracy)
print("Accuracy w/scaled_mm_accuracy Data (mm):", scaled_mm_accuracy)
if (accuracy > scaled_ss_accuracy) and (accuracy > scaled_mm_accuracy):
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print(" ----------------------------------- \n")
elif (scaled_ss_accuracy > scaled_mm_accuracy):
print("\nClassification Report (ss):\n", classification_report(y_test, y_pred_ss_scaled))
print(" ----------------------------------- \n")
else:
print("\nClassification Report (mm):\n", classification_report(y_test, y_pred_mm_scaled))
print(" ----------------------------------- \n")
model_predict(x_train,y_train,x_test,y_test)
So as we can see, the random tree classifier seems to be the more accurate. Let's give it a closer look.
We can use a random forest classifier. It's basically creating lots of decision trees, and then use them all to decide. It's really effective for classification problems such as this one. Each decision trees will take into account a fixed amount of variable among every variables, to create more variation among the trees.
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2,random_state=2, stratify = Y)
scaler = MinMaxScaler()
scaler.fit(x_train)
x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)
Now we would like to know what parameters are optimal for our random forest classifier.
param_grid = {
'n_estimators': [5, 10, 50, 100],
'max_depth': [3, 10, 20, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
rf_classifier = RandomForestClassifier()
grid_search = GridSearchCV(rf_classifier, param_grid, cv=5, scoring='accuracy')
grid_search.fit(x_train_scaled, y_train)
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(x_test_scaled)
accuracy_best = accuracy_score(y_test, y_pred_best)
print("Best Parameters:", best_params)
print("Best Accuracy on Test Set:", accuracy_best)
GridSearchCV allows to try different combinations of parameters, and returns the most effective one. Here we can see that we obtain the best accuracy with every parameters default value as we can see here.
Let's apply our algorithm then:
scores = []
for _ in range(25):
rfc = RandomForestClassifier()
rfc.fit(x_train_scaled, y_train)
y_pred = rfc.predict(x_test_scaled)
scores.append(rfc.score(x_test_scaled, y_test))
print(max(scores))
print(np.mean(scores))
print(min(scores))
We can make a cross validation : it is a way to train several times the algorithm on different parts of the dataset in order to reduce the risk of overfitting.
rfc = RandomForestClassifier()
scores = cross_val_score(rfc, x_train_scaled, y_train, cv=10)
rfc.fit(x_train_scaled, y_train)
print("Cross-validation scores:", scores)
print("Mean cross-validation score:", scores.mean())
We can see our random tree classifier seems to do well. We will try to use it later on on ourselves to see if the algorithm is working with some new data !
features = np.concatenate((x_train_scaled, x_test_scaled), axis=0)
importances = rfc.feature_importances_
std = np.std([tree.feature_importances_ for tree in rfc.estimators_], axis=0)
indices = np.argsort(importances)[::-1]
cmap = plt.get_cmap('cool')
colors = cmap(np.linspace(0, 1, len(indices)))
plt.figure()
plt.title("Feature Importances")
plt.bar(range(features.shape[1]), importances[indices],
color=colors, yerr=std[indices], align="center")
plt.xticks(range(features.shape[1]), pd.DataFrame(features).columns[indices], rotation='vertical')
plt.xlim([-1, features.shape[1]])
plt.show()
x_train.columns
This graph allows us to see which variable is the most impactful to predict the obesity type. 6 columns seem quite effective:
importances_sorted = np.sort(importances)[::-1]
data_importances = {
'Feature': [df.columns[i] for i in indices],
'Importance': [round(importances_sorted[i]*100, 2) for i in range(len(indices))]
}
importance_df = pd.DataFrame(data_importances['Importance'], index=data_importances['Feature'])
importance_df.columns = ['% of importance']
importance_df
rfc = RandomForestClassifier()
rfc.fit(x_train_scaled, y_train)
y_pred = rfc.predict(x_test_scaled)
print(rfc.score(x_test_scaled, y_test))
y_test_np = np.array(y_test)
# Print the results
print("Predicted Labels:")
print(y_pred)
print("\nTrue Labels:")
print(y_test_np)
Let's have a look on the confusion matrix:
cm = confusion_matrix(y_test_np, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=decoded_dict['Corrected_obesity_level_category'].values(), yticklabels=decoded_dict['Corrected_obesity_level_category'].values())
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
Another default parameter we didn't consider here is Bootstrap = True. Bootstrapping resamples the original dataset with replacement many thousands of times to create simulated datasets. This process involves drawing random samples from the original dataset. Hence it is better to avoid overfitting, as every tree will be different.
average_df = df.drop(columns = ['Corrected_obesity_level_category', 'MBI', 'Obesity_level_category', 'Weight','Number_of_main_meals', 'Main_transport_Bike', 'Main_transport_Automobile', 'Main_transport_Motorbike','Main_transport_Public_Transportation','Main_transport_Walking', 'Smoke'])
average_people = {col:[] for col in average_df}
for col in average_df.columns:
average_people[col].append(round(np.mean(df[df['Gender']==1][col])))
average_people[col].append(round(np.mean(df[df['Gender']==0][col])))
average_people_df = pd.DataFrame(average_people)
average_people_df_scaled = scaler.transform(average_people_df)
average_people_prediction = rfc.predict(average_people_df_scaled)
genders = ['Average man', 'Average woman']
results = [decoded_dict['Corrected_obesity_level_category'][result] for result in average_people_prediction]
matrix_results = np.array([genders,results])
matrix_results
Therefore, exclusively from the data contained in this dataset, we can deduce that out of the people interrogated, the average man in all of the categories except MBI, Weight, and Obesity type, probably belongs in the Normal Weight type of obesity, whereas the average woman is predicted to belong in the Obesity_Type_III category (sometimes in the Obesity_Type_III but that depends on the random factor each time we run our model).
This analysis very clearly shows the mispresentation we showed earlier as categorizing the average woman in the Obseity_Type_III (the highest one) is completely unrealistic and unrepresentative of the global population. This is majorly explained by the fact that out of the 264 people categorized in that obesity type, 263 of them are women, which gives WAY too much importance to that feature, much more than it should up to this point.
However it is important to note that the results often change as we refit the algorithm, but this result is the one that occured the most after many tries.
Just out of curiosity, we will test the acccuracy of our algorithm on us, the members of this project :
for col in decoded_dict:
print(f"{col} : {decoded_dict[col]}")
This is simply to help us know which value we should assign to ourselves.
group_members = {'Gender': [1,1,1], 'Age': [21,20,21], 'Height': [1.80,1.76,1.74], 'Family_history_with_overweight': [1,0,0],
'Frequency_eat_high_caloric_food': [1,1,1], 'Frequency_eat_vegetables': [2,2,2], 'Frequency_eat_between_meals': [1,1,2],
'Frequency_water': [2,2,2], 'Monitoring_calories_consumption': [0,0,1], 'Frequency_physical_activity': [2,1,1],
'Time_using_technology_devices': [2,2,2], 'Frequency_alcohol': [3,2,2]}
group_members_df = pd.DataFrame(group_members)
group_members_df_scaled = scaler.transform(group_members_df)
group_members_prediction = rfc.predict(group_members_df_scaled)
members = ['Louis', 'Killian', 'Marc']
predicted_results = [decoded_dict['Corrected_obesity_level_category'][result] for result in group_members_prediction]
matrix_predicted_results = np.array([members,predicted_results])
print(matrix_predicted_results)
members_weight = [67.0, 59.0, 65.0]
real_MBI = [members_weight[i]/group_members['Height'][i]**2 for i in range(len(members))]
real_results = [decoded_dict['Corrected_obesity_level_category'][GetCategory(mbi)] for mbi in real_MBI]
matrix_real_results = np.array([members,real_MBI,real_results])
print(matrix_real_results)
Therefore, we now have a model capable of predicting with a relatively precise accuracy the obesity type of a person depending on all of the features presented in this dataset (except for the Weight column which gives too much information).
Based on our tests and analyses from earlier, we can conclude that our algorithm must have a pretty good basis since all of the predictions applied to our group ended up being correct.
However, we know that it is still very open to errors, as there are many columns that will completely bias our model's predictions, more so than what the features should impact realistically, as as huge part of our data has been artificially generated by oversampling, which leads to an overfitting of certain features.
For example, there are precisely 264 rows that belong to the Obesity_Type_III, which is the most dangerous type in terms of health, but they are extremely poorly distributed.
Indeed, out of these 264 rows, only one of them is a male. Even though this could theorically be plausible, it seems unlikely that out of the group of people interrogated there was such a discriminated representation of that obesity type category. And this is only for the Gender column, there are many other features that have the same problem applied for them.
In the end, oversampling the dataset to get a larger and broader view over the repartition of people belonging to each category was an interesting idea, but was executed poorly, and gave too much of a bias to the dataset.
That being said, with our model now complete and functional, we could apply it to any person given they fill all of their personal data for each of the features required, but we could not find another dataset with these exact same columns, so we only applied it to the members of our group to prove its efficiency.
After multiple tries, we found out that our model always predicts the right category for Killian, but for Louis it seem to switch from Insufficient_Weight to Normal_Weight each time we generate a new model, and for Marc it switches from Normal_Weight to Overweight, so the little part of uncertainty is still clearly visible.
However, our group lacks diversity since all three of us belong in the Normal_Weight category, but we also have similar eating habits and physical conditions. What would have been interesting was to test our algorithm on people with very different habits than us to see how our model would perform with a wider range of subjects.
If you now wish to test our model personaly or with fictional values, you can test our API application and fill the form with all of the data, and you'll find out what obesity type our model predicts for you.